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Mining spatial trajectories aims to extract non-explicit information from spatial trajectory data that can be orga- 
nized as temporally ordered locations, such as taxi GPS logs, twitter check-ins (Zheng, 2015). The field has been 
revolutionizing the traditional means of collecting and processing geo- spatial information for mapping and many 
other real-world applications (Bock, Liu, & Sester, 2016; Yang & Meng, 2015). One of the mining tasks requires the 
labeling of individual points in trajectories with states in query such that the physical measurements can be better in- 
terpreted (Yang, 2016). For example, by means of map matching, each data point in the location sequence is as- 
signed to the road segment on which the moving object traveled, while methods of location-based activity recogni- 
tion are used to identify the most probable activities (e.g., at home, at work, at bar) associated with each location in 
the trajectory data. These labeling tasks impose challenges on label assignments especially when the measurements 
are noisy and when there are non-exclusive semantic correspondences between data points and labels. 

Probabilistic methods are popular in solving these labeling tasks as they often produce better label accuracies. In 
a previous work (Yang, 2016), the labeling of spatial trajectories based on map matching was treated. We developed 
a probabilistic model with conditional random fields, which computes the maximum likelihood of the trajectory data 
given label assignments based on a set of weighted features that captures the correspondences between road seg- 
ments and location observations of the moving objects. On a small taxi GPS dataset, our model outperformed the 
state-of-the-art approaches in terms of both accuracy and reliability of matching GPS taxi trajectories at a low sam- 
pling rate to OpenStreetMap road data. Furthermore, our model employed an optimization process to select the most 
relevant features (i.e., features that improve the model likelihood for map matching, see Fig.l) for matching sparse 
and noisy GPS trajectories, some of which revealed valuable cues to understand drivers’ driving behavior at road in- 
tersections. 

However, some interesting questions remain unanswered in our current results: 1) How to interpret the features 
learned from sample taxi routes? 2) Would the overall confidence (i.e., the likelihood) of the model for the route 
prediction suggest the correct matching result? 3) How the routing preference of an individual taxi driver could de- 
viate from the collective knowledge mined from massive trajectory data? 
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Fig. 1. Features learned for map matching of low sampling rate GPS data. [The weights’ magnitudes indicate the relevance degree of the 
feature to the task. Among all the features, distance error (DistErr), number of left turn (#LeftTum), number of the link in the path 
(#Lnk) and number of different road classes in the path (#RoadClass) are the most relevant ones. [(Source: Yang, 2016) 
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Fig. 2. Recovered paths between GPS data points with sampling rate of 120s. Green paths are ground truths and red ones are results gen- 
erated by our map matching method. The comparisons illustrate the cases when fastest paths ate less preferred by the taxi drivers: (a) 
path with fewer turns, (b)-(c) path skipping traffic crossing, (d)-(e) paths with smooth transitions, (f) path with fewer lane transitions. 
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To investigate aforementioned research questions, this paper scales up the original map matching. Firstly, the 
ground truth data are prepared in a semi-automated manner. Since ground truth data are often not available in mas- 
sive trajectory dataset, a trajectory data management system is developed to match trajectory data with a high sam- 
pling rate using Hidden Markov Model (Newson & Krumm, 2009), followed by a carefully designed manual visual 
validation to ensure the quality of the ground truth data. Secondly, visual analytics approaches (Ding, 2016) are pro- 
posed to explore spatio-temporal patterns of individual routing preferences (Fig .2 shows that drivers are not taking 
shortest path, fastest path in some cases), namely when individual drivers deviate from the fastest route and how of- 
ten they make these decisions. Thirdly, we train our chain structured conditional random fields with these labeled 
data. And experiments, e.g. with the examination of the turn-by-turn patterns at different road intersections, are per- 
formed to interpret features against daily driving experiences. 

Based on this extensive study, a number of conclusions incl. our new insight can be drawn: 1) The quality of la- 
beled data has a significant impact on the feature learning results and the performance of the probabilistic models; 2) 
The interpretation of learned features should be carefully used to understand routing preference. 
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